For this part, you'll be using PySpark to process tweets, i.e., messages generated on twitter.
This Jupyter notebook contains: (i) instructions to setup the programming environment, and (ii) the programming problems. You will use this notebook to write your code and submit it for grading.
To run your code, you will need to connect to a Jupyter server maintained by CSC. Please read the following instructions very carefully.
The server will restart after 15 minutes of inaction.
Therefore, if you have made changes to the notebook but want to pause your work, make sure you download the notebook to your computer. You can upload it again when you're ready to resume your work.
What to do if the server restarts while inactive To save the code you wrote, copy and paste it manually to a text file on your computer. You can then follow again the main steps described above.
You will work with two datasets:
When working with the big dataset, you will be using a bigger cluster than the one you'll be using when working with the small dataset. To see how many other students are using it and if there are available resources, visit this webpage: https://86.50.170.119/resources.
If you receive a message that says that the server is full, allow 15-20 minutes before you try to access it again.
Attention! You must develop and run your code before the day of the deadline. We cannot guarantee support for any failures that happen on that day.
Run the following cells once before you start working on your solutions.
Note: If you attempt to create a SparkContext
twice, you will get an error.
In [ ]:
## Imports and SparkContext
import pyspark
import numpy as np # use numpy for advanced numeric operations
import ast # we'll use this module to read data
In [ ]:
## SMALL dataset
## Run this cell *ONLY* if you want to be
## using the SMALL dataset
SMALL_FILE = "file:///data/small.input"
DATA_FILE = SMALL_FILE
In [ ]:
## BIG dataset
## Run this cell *ONLY* if you want to be
## using the BIG dataset
BIG_FILE = "hdfs:///moderndb/input_2.5m.input"
## Environment parameters for the big cluster
## Use them ONLY if you're working with the big dataset
import os
os.environ["PYSPARK_PYTHON"]="/opt/conda/bin/python3"
os.environ["SPARK_HOME"]="/usr/hdp/current/spark-client"
os.environ["HDP_VERSION"]="current"
DATA_FILE = BIG_FILE
In [ ]:
sc = pyspark.SparkContext()
load_func = sc.textFile
In [ ]:
data = load_func(DATA_FILE)
Most lines in the file contain a string representation of a python dictionary that represents a tweet
. A tweet is a message that has been generated on twitter. Below you see the example of one tweet (one line in the file).
{u'contributors': None, u'coordinates': None, u'created_at': u'Fri Sep 26 08:01:38 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [{u'display_url': u'bbc.in/1qBcZ4G', u'expanded_url': u'http://bbc.in/1qBcZ4G', u'indices': [126, 140], u'url': u'http://t.co/8kPs90syqo'}], u'user_mentions': [{u'id': 2190056023, u'id_str': u'2190056023', u'indices': [3, 9], u'name': u'BBC Outside Source', u'screen_name': u'BBCOS'}]}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None, u'id': 515410856616935425, u'id_str': u'515410856616935425', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'place': None, u'possibly_sensitive': False, u'retweet_count': 0, u'retweeted': False, u'retweeted_status': {u'contributors': None, u'coordinates': None, u'created_at': u'Fri Sep 26 07:44:43 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [{u'display_url': u'bbc.in/1qBcZ4G', u'expanded_url': u'http://bbc.in/1qBcZ4G', u'indices': [115, 137], u'url': u'http://t.co/8kPs90syqo'}], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'low', u'geo': None, u'id': 515406597829693440, u'id_str': u'515406597829693440', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'place': None, u'possibly_sensitive': False, u'retweet_count': 3, u'retweeted': False, u'source': u'<a href="http://www.socialflow.com" rel="nofollow">SocialFlow</a>', u'text': u"EU's anti-terrorism chief tells BBC the number of Europeans joining Islamist fighters in Syria and Iraq now 3,000+ http://t.co/8kPs90syqo", u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Tue Nov 12 10:17:10 +0000 2013', u'default_profile': False, u'default_profile_image': False, u'description': u'Real-time reports from inside the BBC newsroom with @BBCRosAtkins. Combining your sources and ours. BBC World Service radio 10GMT, BBC World News TV 17GMT.', u'favourites_count': 734, u'follow_request_sent': None, u'followers_count': 16500, u'following': None, u'friends_count': 750, u'geo_enabled': False, u'id': 2190056023, u'id_str': u'2190056023', u'is_translator': False, u'lang': u'en-gb', u'listed_count': 340, u'location': u'London', u'name': u'BBC Outside Source', u'notifications': None, u'profile_background_color': u'131516', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme14/bg.gif', u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme14/bg.gif', u'profile_background_tile': True, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2190056023/1399384694', u'profile_image_url': u'http://pbs.twimg.com/profile_images/421268342767222784/17yQM0_d_normal.jpeg', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/421268342767222784/17yQM0_d_normal.jpeg', u'profile_link_color': u'009999', u'profile_sidebar_border_color': u'EEEEEE', u'profile_sidebar_fill_color': u'EFEFEF', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'BBCOS', u'statuses_count': 6602, u'time_zone': u'London', u'url': u'http://www.bbc.co.uk/programmes/p01k2bx3', u'utc_offset': 3600, u'verified': True}}, u'source': u'<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>', u'text': u"RT @BBCOS: EU's anti-terrorism chief tells BBC the number of Europeans joining Islamist fighters in Syria and Iraq now 3,000+ http://t.co/8\u2026", u'timestamp_ms': u'1411718498766', u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Sun Nov 13 20:46:00 +0000 2011', u'default_profile': True, u'default_profile_image': False, u'description': u'The more I do nothing, the less time I have to do anything', u'favourites_count': 27, u'follow_request_sent': None, u'followers_count': 216, u'following': None, u'friends_count': 747, u'geo_enabled': True, u'id': 411752020, u'id_str': u'411752020', u'is_translator': False, u'lang': u'en', u'listed_count': 1, u'location': u'', u'name': u'Gerald Quinlan', u'notifications': None, u'profile_background_color': u'C0DEED', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_tile': False, u'profile_image_url': u'http://pbs.twimg.com/profile_images/1661078970/image_normal.jpg', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/1661078970/image_normal.jpg', u'profile_link_color': u'0084B4', u'profile_sidebar_border_color': u'C0DEED', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'QuinlanQuinlan', u'statuses_count': 9223, u'time_zone': None, u'url': None, u'utc_offset': None, u'verified': False}}
Note that some of the lines might be corrupt -- i.e., they do not contain a tweet, but other information, e.g., a logging message. It is your responsibility to deal with corrupt lines and make sure they do not affect your computation.
If tweet_string
is the string representation of one tweet, you can load it into a Python dictionary with the following statement.
tweet = ast.literal_eval(tweet_string)
Use PySpark to write and execute queries for the following tasks (1 - 10). Generate one cell per task.
You are free to reuse (possibly persisted) RDDs and other code from earlier tasks.
Tasks
Find out how many lines (tweets or corrupt) there are there in the data.
Inspect any three tweets and infer the tweet schema from them, as completely as you can.
Each tweet is identified by its unique id
value. Some tweets might appear in more than one lines in the data (i.e., the same tweet id might appear in more than one lines). How many tweets are there that appear in only one line?
Each tweet is associated with a lang
value, that denotes the language the tweet was written in. How many different languages are there in the dataset?
How many lines are there in the data for each language? (Ignore tweet ids for this task).
How many tweets are there in the data for the english language (lang = 'en')?
Each tweet is associated with a text
value that stores the content of the message. What is the minimum and maximum message length in the data? Note: you should compute both values in a single pass over the data.
Consider the text
message that appears in a tweet. We define a word to be a maximal sequence of alphanumeric characters found in the text, after the text has been converted to lowercase. For example, if the text is
My username is spark123 & my password is spar!.Kk
then the words contained in it are the following.
my, username, is, spark123, my, password, is, spar, kk
Find the 1000 most frequent words in the text
messages of all english tweets (lang = 'en') along with the number of their occurances.
Find the 100 most frequent words in english tweets (lang = 'en') that start with each character of the latin alphabet ('a', 'b', ..., 'z'). Use the lowercase version of messages.
Each tweet is associated with a user
who generated it; and the user is associated with a unique screen_name
. Find the screen_name
of the $10$ users with the most english tweets (with distinct tweet ids) in the data.
Note Twitter users are also associated with an id
value, different than the tweet id
. Make sure you do not confuse the two.
In [ ]:
## 1
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 2
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 3
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 4
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 5
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 6
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 7
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 8
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 9
# Your code goes here
Use this cell to explain / expand your answer.
In [ ]:
## 10
# Your code goes here
Use this cell to explain / expand your answer.
Some tweets are replies
to tweets written by other Twitter users. You can tell which tweets are replies by checking the in_reply_to_screen_name
value of a tweet: if it is None
, then it is not a reply - otherwise, it is a reply to the user with the screen_name
mentioned therein.
For example, in_reply_to_screen_name: None
signifies that the tweet is not a reply to another tweet, but in_reply_to_screen_name: northern_bytes
signifies that the tweet is a reply to another tweet generated by user northern_bytes
.
We are going to construct a graph in the following steps:
Construct a pair RDD named linksRDD to store the adjacency list of each node. Specifically, for each node (user) $u$ in the graph, you should have one element of the following form
(screen_name, [screen_name_1, screen_name_2, ..., screen_name_v, ...])
where screen_name
corresponds to user $u$ and screen_name_v
corresponds to a user $v$ that $u$ is connected to.
In [ ]:
# Your code goes here
How many nodes and edges are there in the graph?
In [ ]:
# Your code goes here
In [ ]:
## Pagerank implementation
# Your code goes here
Report the $5$ nodes with highest pagerank values.
In [ ]:
# Your code goes here
We are grateful to:
If the file is updated (e.g., to add a clarification), a short description of the update will be listed here, as well as on mycourses.
In [ ]: